34 research outputs found

    Comparing De Novo Genome Assembly: The Long and Short of It

    Get PDF
    Recent advances in DNA sequencing technology and their focal role in Genome Wide Association Studies (GWAS) have rekindled a growing interest in the whole-genome sequence assembly (WGSA) problem, thereby, inundating the field with a plethora of new formalizations, algorithms, heuristics and implementations. And yet, scant attention has been paid to comparative assessments of these assemblers' quality and accuracy. No commonly accepted and standardized method for comparison exists yet. Even worse, widely used metrics to compare the assembled sequences emphasize only size, poorly capturing the contig quality and accuracy. This paper addresses these concerns: it highlights common anomalies in assembly accuracy through a rigorous study of several assemblers, compared under both standard metrics (N50, coverage, contig sizes, etc.) as well as a more comprehensive metric (Feature-Response Curves, FRC) that is introduced here; FRC transparently captures the trade-offs between contigs' quality against their sizes. For this purpose, most of the publicly available major sequence assemblers – both for low-coverage long (Sanger) and high-coverage short (Illumina) reads technologies – are compared. These assemblers are applied to microbial (Escherichia coli, Brucella, Wolbachia, Staphylococcus, Helicobacter) and partial human genome sequences (Chr. Y), using sequence reads of various read-lengths, coverages, accuracies, and with and without mate-pairs. It is hoped that, based on these evaluations, computational biologists will identify innovative sequence assembly paradigms, bioinformaticists will determine promising approaches for developing “next-generation” assemblers, and biotechnologists will formulate more meaningful design desiderata for sequencing technology platforms. A new software tool for computing the FRC metric has been developed and is available through the AMOS open-source consortium

    Chromosomal-level assembly of the Asian Seabass genome using long sequence reads and multi-layered scaffolding

    Get PDF
    We report here the ~670 Mb genome assembly of the Asian seabass (Lates calcarifer), a tropical marine teleost. We used long-read sequencing augmented by transcriptomics, optical and genetic mapping along with shared synteny from closely related fish species to derive a chromosome-level assembly with a contig N50 size over 1 Mb and scaffold N50 size over 25 Mb that span ~90% of the genome. The population structure of L. calcarifer species complex was analyzed by re-sequencing 61 individuals representing various regions across the species' native range. SNP analyses identified high levels of genetic diversity and confirmed earlier indications of a population stratification comprising three clades with signs of admixture apparent in the South-East Asian population. The quality of the Asian seabass genome assembly far exceeds that of any other fish species, and will serve as a new standard for fish genomics

    A Single Molecule Scaffold for the Maize Genome

    Get PDF
    About 85% of the maize genome consists of highly repetitive sequences that are interspersed by low-copy, gene-coding sequences. The maize community has dealt with this genomic complexity by the construction of an integrated genetic and physical map (iMap), but this resource alone was not sufficient for ensuring the quality of the current sequence build. For this purpose, we constructed a genome-wide, high-resolution optical map of the maize inbred line B73 genome containing >91,000 restriction sites (averaging 1 site/∼23 kb) accrued from mapping genomic DNA molecules. Our optical map comprises 66 contigs, averaging 31.88 Mb in size and spanning 91.5% (2,103.93 Mb/∼2,300 Mb) of the maize genome. A new algorithm was created that considered both optical map and unfinished BAC sequence data for placing 60/66 (2,032.42 Mb) optical map contigs onto the maize iMap. The alignment of optical maps against numerous data sources yielded comprehensive results that proved revealing and productive. For example, gaps were uncovered and characterized within the iMap, the FPC (fingerprinted contigs) map, and the chromosome-wide pseudomolecules. Such alignments also suggested amended placements of FPC contigs on the maize genetic map and proactively guided the assembly of chromosome-wide pseudomolecules, especially within complex genomic regions. Lastly, we think that the full integration of B73 optical maps with the maize iMap would greatly facilitate maize sequence finishing efforts that would make it a valuable reference for comparative studies among cereals, or other maize inbred lines and cultivars

    Multiple Pathway-Based Genetic Variations Associated with Tobacco Related Multiple Primary Neoplasms

    Get PDF
    BACKGROUND: In order to elucidate a combination of genetic alterations that drive tobacco carcinogenesis we have explored a unique model system and analytical method for an unbiased qualitative and quantitative assessment of gene-gene and gene-environment interactions. The objective of this case control study was to assess genetic predisposition in a biologically enriched clinical model system of tobacco related cancers (TRC), occurring as Multiple Primary Neoplasms (MPN). METHODS: Genotyping of 21 candidate Single Nucleotide Polymorphisms (SNP) from major metabolic pathways was performed in a cohort of 151 MPN cases and 210 cancer-free controls. Statistical analysis using logistic regression and Multifactor Dimensionality Reduction (MDR) analysis was performed for studying higher order interactions among various SNPs and tobacco habit. RESULTS: Increased risk association was observed for patients with at least one TRC in the upper aero digestive tract (UADT) for variations in SULT1A1 Arg²¹³His, mEH Tyr¹¹³His, hOGG1 Ser³²⁶Cys, XRCC1 Arg²⁸⁰His and BRCA2 Asn³⁷²His. Gene-environment interactions were assessed using MDR analysis. The overall best model by MDR was tobacco habit/p53(Arg/Arg)/XRCC1(Arg³⁹⁹His)/mEH(Tyr¹¹³His) that had highest Cross Validation Consistency (8.3) and test accuracy (0.69). This model also showed significant association using logistic regression analysis. CONCLUSION: This is the first Indian study on a multipathway based approach to study genetic susceptibility to cancer in tobacco associated MPN. This approach could assist in planning additional studies for comprehensive understanding of tobacco carcinogenesis

    Global burden of disease due to smokeless tobacco consumption in adults : analysis of data from 113 countries

    Get PDF
    BACKGROUND: Smokeless tobacco is consumed in most countries in the world. In view of its widespread use and increasing awareness of the associated risks, there is a need for a detailed assessment of its impact on health. We present the first global estimates of the burden of disease due to consumption of smokeless tobacco by adults. METHODS: The burden attributable to smokeless tobacco use in adults was estimated as a proportion of the disability-adjusted life-years (DALYs) lost and deaths reported in the 2010 Global Burden of Disease study. We used the comparative risk assessment method, which evaluates changes in population health that result from modifying a population's exposure to a risk factor. Population exposure was extrapolated from country-specific prevalence of smokeless tobacco consumption, and changes in population health were estimated using disease-specific risk estimates (relative risks/odds ratios) associated with it. Country-specific prevalence estimates were obtained through systematically searching for all relevant studies. Disease-specific risks were estimated by conducting systematic reviews and meta-analyses based on epidemiological studies. RESULTS: We found adult smokeless tobacco consumption figures for 115 countries and estimated burden of disease figures for 113 of these countries. Our estimates indicate that in 2010, smokeless tobacco use led to 1.7 million DALYs lost and 62,283 deaths due to cancers of mouth, pharynx and oesophagus and, based on data from the benchmark 52 country INTERHEART study, 4.7 million DALYs lost and 204,309 deaths from ischaemic heart disease. Over 85 % of this burden was in South-East Asia. CONCLUSIONS: Smokeless tobacco results in considerable, potentially preventable, global morbidity and mortality from cancer; estimates in relation to ischaemic heart disease need to be interpreted with more caution, but nonetheless suggest that the likely burden of disease is also substantial. The World Health Organization needs to consider incorporating regulation of smokeless tobacco into its Framework Convention for Tobacco Control

    Rickettsia Phylogenomics: Unwinding the Intricacies of Obligate Intracellular Life

    Get PDF
    BACKGROUND: Completed genome sequences are rapidly increasing for Rickettsia, obligate intracellular alpha-proteobacteria responsible for various human diseases, including epidemic typhus and Rocky Mountain spotted fever. In light of phylogeny, the establishment of orthologous groups (OGs) of open reading frames (ORFs) will distinguish the core rickettsial genes and other group specific genes (class 1 OGs or C1OGs) from those distributed indiscriminately throughout the rickettsial tree (class 2 OG or C2OGs). METHODOLOGY/PRINCIPAL FINDINGS: We present 1823 representative (no gene duplications) and 259 non-representative (at least one gene duplication) rickettsial OGs. While the highly reductive (approximately 1.2 MB) Rickettsia genomes range in predicted ORFs from 872 to 1512, a core of 752 OGs was identified, depicting the essential Rickettsia genes. Unsurprisingly, this core lacks many metabolic genes, reflecting the dependence on host resources for growth and survival. Additionally, we bolster our recent reclassification of Rickettsia by identifying OGs that define the AG (ancestral group), TG (typhus group), TRG (transitional group), and SFG (spotted fever group) rickettsiae. OGs for insect-associated species, tick-associated species and species that harbor plasmids were also predicted. Through superimposition of all OGs over robust phylogeny estimation, we discern between C1OGs and C2OGs, the latter depicting genes either decaying from the conserved C1OGs or acquired laterally. Finally, scrutiny of non-representative OGs revealed high levels of split genes versus gene duplications, with both phenomena confounding gene orthology assignment. Interestingly, non-representative OGs, as well as OGs comprised of several gene families typically involved in microbial pathogenicity and/or the acquisition of virulence factors, fall predominantly within C2OG distributions. CONCLUSION/SIGNIFICANCE: Collectively, we determined the relative conservation and distribution of 14354 predicted ORFs from 10 rickettsial genomes across robust phylogeny estimation. The data, available at PATRIC (PathoSystems Resource Integration Center), provide novel information for unwinding the intricacies associated with Rickettsia pathogenesis, expanding the range of potential diagnostic, vaccine and therapeutic targets
    corecore